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ABSTRACT 

Search engines have become key media for our scientific, eco- 
nomic, and social activities by enabling people to access in- 
formation on the Web in spite of its size and complexity. On 
the down side, search engines bias the traffic of users accord- 
ing to their page-ranking strategies, and some have argued 
that they create a vicious cycle that amplifies the domi- 
nance of established and already popular sites. We show 
that, contrary to these prior claims and our own intuition, 
the use of search engines actually has an egalitarian effect. 
We reconcile theoretical arguments with empirical evidence 
showing that the combination of retrieval by search engines 
and search behavior by users mitigates the attraction of pop- 
ular pages, directing more traffic toward less popular sites, 
even in comparison to what would be expected from users 
randomly surfing the Web. 

Categories and Subject Descriptors 

H.3.3 [Information Storage and Retrieval]: Informa- 
tion Search and Retrieval; H.3.4 [Information Storage 
and Retrieval]: Systems and Software — Information net- 
works; H.3.5 [Information Storage and Retrieval]: On- 
line Information Services — Commercial, Web-based services; 

H. 5.4 [Information Interfaces and Presentation]: Hy- 
pertext/Hypermedia — Navigation, user issues; K.4.m [Com- 
puters and Society]: Miscellaneous 

General Terms 

Measurement 

Keywords 

Search engines, bias, popularity, traffic, PageRank, in-degree. 

I. INTRODUCTION 

The crucial role of the Web as a communication medium 
and its unsupervised, self-organized development have trig- 
gered the intense interest of the scientific community. The 
topology of the Web as a complex, scale-free network is 
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now well characterized 1161 |SJ Q I17| . Several growth 
and navigation models have been proposed to explain the 
Web's emergent topological characteristics and their effect 
on users' surfing behavior [SJ [THI HH H] As 

the size and complexity of the Web have increased, users 
have become reliant on search engines |19l I20| . so that the 
paradigm of search is replacing that of navigation as the 
main interface between people and the Web |31l I29| . This 
leads to questions about the role of search engines in shaping 
the use and evolution of the Web. 

One common belief is that the use of search engines bi- 
ases traffic toward popular sites. This is at the origin of the 
vicious cycle illustrated in Fig. Q Pages highly ranked by 
search engines are more likely to be discovered and conse- 
quently linked to by other pages. This in turn would fur- 
ther increase the popularity and raise the average rank of 
those pages. As popular pages become more and more pop- 
ular, new pages are unlikely to be discovered UJ- Such a 
cycle would accelerate the rich-get-richer dynamics already 
observed in the Web's network structure and explained by 
preferential attachment and link copy models 1161 I18| . 
This presumed phenomenon, also known as search engine 
bias, entrenchment effect, or googlearchy, has been widely 
discussed in computer, social and political science |14l 1241 
H] 1131 U21 12(i| and methods to counteract it are being pro- 
posed [TO! no. 

In this paper we use both empirical and theoretical ar- 
guments to show that the bias of search engines is of the 
opposite nature, namely directing more traffic toward less 
popular pages compared to the case in which no search oc- 
curs and all traffic is generated by surfing hyperlinks. Our 
contributions are organized as follows: 

• We develop a simple modeling framework in which one 
can quantify the amount of traffic that Web sites re- 
ceive in the extreme cases in which users browse the 
Web by surfing random hyperlinks and in which users 
only visit pages returned by search engines in response 
to queries. The framework, introduced in Section [5] 
allows to make and compare predictions about how 
navigation and search steer traffic and thus bias the 
popularity of Web sites. 
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Figure 1: Illustration of search engine bias. A. Page i is "popular" in that it has many incoming links and 
high PageRank. A user creates a new page j. B. The user consults a search engine to find pages related to 
j. Since i is ranked highly by the search engine, it has a high probability of being returned to the user. C. 
The user, having discovered i, links to it from j. Thus i becomes even more popular from the search engine's 
perspective. 



• In Section |3| we provide a first empirical study of the 
traffic toward Web pages as a function of their in- 
degree. This particular relationship is the one that can 
directly validate the models in Section [5] As it turns 
out, both the surfing and searching models are surpris- 
ingly wrong; the bias in favor of popular pages seems 
to be mitigated, rather than enhanced, by the com- 
bination of search engines and users' search behavior. 
This result contradicts prior assumptions about search 
engine bias. 

• The unexpected empirical observation on traffic is ex- 
plained in Section |1J where we take into considera- 
tion a previously neglected factor about search results, 
namely the distribution and composition of hit set 
size. This distribution, determined empirically from 
actual user queries, allows one to reconcile the search- 
ing model with the empirical data of Section [3] Using 
theoretical arguments and numerical simulations we 
show that the search model, revised to take queries into 
account, accurately predicts traffic trends confirming 
the egalitarian bias of search engines. 

2. MODELING THE VICIOUS CYCLE 

For a quantitative definition of popularity we turn to the 
probability that a generic user clicks on a link leading to a 
specific page |1U| . We will also refer to this quantity as the 
traffic to the same page. 

2.1 Surfing model of traffic 

In the absence of search engines, people would browse 
Web pages primarily by following hyperlinks. It is natural 
to assume that the amount of such surfing-generated traffic 
directed toward a given page is proportional to the num- 
ber of links k pointing to it. The more the pages pointing to 
that page, the larger the probability that a randomly surfing 
user will discover it. Successful search engines, Google be- 
ing the premier example |7j, have modeled this effect in their 



ranking functions to gauge page importance. The PageR- 
ank value p(i) of page i is defined as the probability that a 
random walker moving on the Web graph will visit i next, 
thereby estimating the page's discovery probability accord- 
ing to the global structure of the Web. Experimental obser- 
vations and theoretical results show that, with good approx- 
imation, p ~ k (see Appendix^. Therefore, in the surfing 
model where users only visit pages by following links, the 
traffic through a page is given by t ~ p ~ k. 

2.2 Searching model of traffic 

When navigation is mediated by search engines, to esti- 
mate the traffic directed toward a page, one must consider 
how search engines retrieve and rank results, as well as how 
people use these results. Following the seminal paper by 
Cho and Roy this means that we need to find two rela- 
tionships: (i) how the PageRank translates into the rank of 
a result page, and (ii) how the rank of a hit translates into 
the probability that the user clicks on the corresponding link 
thus visiting the page. 

The first step is to determine the scaling relationship be- 
tween PageRank (and equivalently in-degree as discussed 
above) and rank. Search engines employ many factors to 
rank pages. Such factors are typically query-dependent: 
whether the query terms appear in the title or body of a 
page, for example. They also use a global (query-independent) 
importance measure, such as PageRank, to judge the value 
of search hits. If we average across many user queries, we 
expect PageRank to determine the average rank r of each 
page within search results: the page with the largest p has 
average rank r ~ I and so on, in decreasing order of p. 

Statistically, r and p have a non-linear relationship. There 
is an exact mathematical relationship between the value of a 
variable p and the rank of that value, assuming that a set of 
measures is described by a normalized histogram (or distri- 
bution) Pr(p). The rank r is essentially the number of mea- 
sures greater than p, i.e., r = N J J f max Pr(x)dx, where p m ax 
is the largest measure gathered and N the number of mea- 
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Figure 2: A: Distribution of PageRank p: the log- 
log plot shows a power law Pr(p) ~ p" 21 . B: Em- 
pirical relation between rank and PageRank: the 
log-log plot shows a power law r ~ p . Both plots 
are based on data from a WebBase 2003 crawl |30| . 



sures. Empirically we find that the distribution of PageRank 
is a power law p~ M with exponent n ~ 2.1 (Fig.|3jV). In gen- 
eral, when the variable p is distributed according to a power 
law with exponent — y, and neglecting large iV corrections 
one obtains: 



r(p) 



-0 



(1) 



where f3 = fj,— 1 « 1.1. Cho and Roy 9 derived the relation 
between p and r differently, by fitting the empirical curve of 
rank vs. PageRank obtained from a large WebBase crawl. 
Their fit returns a somewhat different value for the exponent 
P of 3/2. To check this discrepancy we used Cho and Roy's 
method and fitted the empirical curve of rank vs. PageRank 
from our WebBase sample, confirming our estimate of j3 over 
three orders of magnitude (Fig. |2J3). 

The second step, still following ref. U, is to approximate 
the traffic to a given page by the probability that when the 
page is returned by a search engine, the user will click on 
its link. We expect the traffic f to a page to be a decreasing 
function of its rank r. Lempel and Moran |21| reported a 
non-linear relation t ~ r~ a , confirmed by our analysis using 
query logs from AltaVista as shown in Fig. |3] 
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Figure 3: Scaling relationship between click prob- 
ability t and hit rank r: the log-log plot shows a 
power law t ~ r" 1 63 (data from a sample of 7 million 
queries submitted to AltaVista between September 
28 and October 3, 2001). 



Note that the rank plotted on the x-axis of Fig.[3]does not 
refer exactly to the absolute position of a hit i in the list of 
hits, but rather to the rank of the result page where the link 
to i appears. Search engines display query results in pages 
containing a fixed number of hits (usually 10). Assuming 
that each result page contains 10 items, as in the Altavista 
queries we examined, all hits from the first to the tenth will 
appear in the first result page and the corresponding click 
probabilities will be cumulated, giving the leftmost point in 
the plot. The same is done for the hits from the 11 th to 
the 20 th , from the 21 s * to the 30 th , and so on. In lack of 
better information we consider result pages instead of single 
hits, implicitly assuming that within each result page the 
probability to click on a link is independent of its position. 
This assumption is reasonable, although there can still be 
a gradient between the top and the bottom hits, as people 
usually read the list starting from the top. 

The sudden drop near the 21 s * result page in Figgis due 
to the way AltaVista operated during the summer 2001, 
when they decided to limit the list of results to 200 pages 
per query (displayed in 20 result pages). We therefore lim- 
ited the analysis to the first 20 data points, which can be 
fitted quite well by a simple power law relation between the 
probability t that a user clicks on a hit and the rank r p of 
the result page where this hit is displayed: 

t ~ r v ~ a (2) 

with exponent a = 1.63 ± 0.05. The fit exponent obtained 
by Cho and Roy was 3/2, which is close to our estimate. 

In our calculations we took into account the grouping of 
the hits in result pages, consistently with the empirical result 
of Fig. El However we noticed that if one replaces in Eq. |5] 
the rank r p of the result page with the absolute rank r of the 
individual hits, the final results do not change appreciably. 
Therefore to simplify the discussion we shall assume from 
now on that 



t 



(3) 



The rapid decrease of t with the rank r of the hit clearly 
indicates that users focus with larger probability on the top 
results. 

We are now ready to express the traffic as a function of 
page in-degree k using the general scaling relation t ~ k 1 . 
In the pure surfing model, 7 = 1; in the searching model, we 
take advantage of the relations between t and r, between r 
and p, and between p and k to obtain 



t 



a/3 



(4) 



and therefore 7 = af3, ranging between 7 ~ 1.8 (according 
to our measures a ~ 1.63, (3 ~ 1.1) and 2.25 (according to 
estimates by others |21ll9"]). 

In all cases, the searching model leads to a value 7 > 1. 
This superlinear behavior implies that the common use of 
search engines will bias traffic toward already popular sites. 
This is at the origin of the vicious cycle illustrated in Fig-H 
Pages highly ranked by search engines are more likely to be 
discovered (as compared to pure surfing) and consequently 
linked to by other pages. This in turn would further in- 
crease their PageRank and raise the average rank of those 
pages. Popular pages become more and more popular, while 
new pages are unlikely to be discovered |JJ. Such a cycle 
would accelerate the rich-get-richer dynamics already ob- 
served in the Web's network structure |5l ll6llT8|| . This pre- 
sumed phenomenon has been dubbed search engine bias or 
entrenchment effect and has been recently brought to the 
attention of the technical Web community 4 9 2(jj, and 
methods to counteract it have been proposed There 
are also notable social and political implications to such a 
googlearchy |mi24l[T3) . 
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3. EMPIRICAL DATA 

To determine whether such a vicious cycle really exists, 
let us consider the empirical data. Given a Web page, its 
in-degree is the number of links pointing to it, which can 
be easily estimated using a search engine such as Google or 
Yahoo J 121 132| . Traffic is the fraction of all user clicks in 
some period of time that lead to the page; this quantity, 
also known as view popularity |1(J|. can be estimated using 
the Alexa Traffic Rankings service, which monitors the sites 
viewed by users of its toolbar We used the Yahoo and 
Alexa services to estimate in-degree and traffic for a total of 
28,164 Web pages. Of these, 26,124 were randomly selected 
using Yahoo's random page service. The remaining 2,040 
pages were selected among the sites that Alexa reports as 
the ones with highest traffic. The resulting density plot is 
shown in Fig. 

To ensure the robustness of our analysis, we collected our 
data twice at a distance of two months. While there were 
differences in the numbers (for example Yahoo increased the 
size of its index significantly in the meanwhile), there were 
no differences in the scaling relations. We also collected in- 
degree data using Google 12: , again yielding different num- 
bers but the same trend. The in-degree measures exclude 
links from the same site. For example, to find the in-degree 
for http://informatics.indiana.edu/ we would submit the 
query "link:http : //informatics . indiana. edu/ 

-site : informatics . indiana. edu" . Note that the in-degree 
data provided by search engines is only an estimate of the 
true number. First, a search engine can only know of links 
from pages that it has crawled and indexed. Second, for per- 



Figure 4: A. Density plot of traffic versus in-degree 
for a sample of 28,164 Web sites. Colors represent 
the fraction of sites in each log-size bin, on a log- 
arithmic color scale. A few sites with highest in- 
degree and/or traffic are highlighted. The source of 
in-degree data is Yahoo 32 ; using Google [12 yields 
the same trend. Traffic is measured as the fraction 
of all page views in a three-month period, according 
to Alexa data B. Relationship between average 
traffic and in-degree obtained with logarithmic bin- 
ning of in-degree. The power-law predictions of the 
surfing and searching models discussed in the text 
are also shown. 



formance reasons, the algorithms counting inlinks use vari- 
ous unpublished approximations based on sampling. 

Traffic is measured as page views per million in a three- 
month period. Alexa collects and aggregates historical traf- 
fic data from millions of Alexa Toolbar users. Page views 
measure the number of pages viewed by these users. Multi- 
ple page views of the same page made by the same user on 
the same day are counted only once. Our measure of traffic t 
corresponds to Alexa's count, divided by 10 6 to express the 
fraction of all the page views by toolbar users go to a par- 
ticular site. Since traffic data is only available for Web sites 
rather than single pages, we correlate the traffic of a site with 
the in-degree of its main page. For example, suppose that we 
want the traffic for http://informatics.indiana.edu/ Alexa 
reports the 3- month average traffic of the domain indiana.edu 



as 9.1 page views per million. Further, Alexa reports that 
2% of the page views in this domain goes to the 
informatics.indiana.edu subdomain. Thus we reach the es- 
timate of 0.182 page views per million. 

To derive a scaling relation, we average traffic along loga- 
rithmic bins for in-degree, as shown in Fig.2j3. Surprisingly, 
both the searching and surfing models fail to match the ob- 
served scaling, which is not modeled well by a power law. 
Contrary to our expectation, the scaling relation is sublin- 
ear, suggesting that search engines actually have an egalitar- 
ian effect, directing more traffic than expected to less pop- 
ular sites — those having lower PageRank and fewer links 
to them. Search engines thus have the effect of counteract- 
ing the skewed distribution of links in the Web, directing 
some traffic toward sites that users would never visit oth- 
erwise. This result is at odds with the previous theoretical 
discussion; in order to understand the empirical data, we 
need to include a neglected but basic feature of the Web: 
the semantic match between queries and page content. 

4. QUERIES AND HIT SET SIZE 

In the previous theoretical estimate of traffic as driven by 
search engines, we considered the global rank of a page, com- 
puted across all pages indexed by the search engine. How- 
ever, any given query typically returns only a small number 
of pages compared to the total number indexed by the search 
engine. The size of the "hit" set and the nature of the query 
introduce a significant bias in the sampling process. If only 
a small fraction of pages are returned in response to a query, 
their rank within the set is not representative of their global 
rank as induced, say, by PageRank. 

Let us assume that all query result lists derive from a 
Bernoulli process such that the number of hits relevant to 
each query is on average hN where h is the relative hit set 
size. In Appendix [B] we show that this assumption leads 
to an alteration in the relationship between traffic and in- 
degree. To illustrate this effect, Fig. [3^. shows how the click 
probability changes with h. The result t ~ k 1 (or t ~ r~ a , 
cf. Fig- 01 on ly holds in the limit case h — ► 1. Since the size 
of the hit sets is not fixed, but depends on user queries, we 
measured the distribution of hit set sizes for actual AltaVista 
queries as shown in Fig. yielding Pr(ft) ~ h~ 6 , with 
{ fS 1.1 over seven orders of magnitude. The exponential 
cutoff in the distribution of h is due to the maximum size 
hm of actual hit lists corresponding to non-noise terms, and 
thus can be disregarded for our analysis. 

The traffic behavior is therefore a convolution of the differ- 
ent curves reported in Fig. EJA, weighted by Pr(h). The final 
relation between traffic and degree can thus be obtained by 
numerical techniques (see Appendix |5J and, strikingly, the 
resulting behavior reproduces the empirical data over four 
orders of magnitude, including the peculiar saturation ob- 
served for high-traffic sites (Fig.QJ]). Most importantly, the 
theoretical behavior predicts a traffic increase for pages with 
increasing in-degree that is noticeably slower than the pre- 
dictions of both the surfing and searching models. In other 
words, the combination of search engines, the semantic at- 
tributes of queries, and users' own behavior mitigates the 
rich- get-richer dynamics of the Web, providing low-degree 
pages with increased visibility. 

Of course, actual Web traffic is a combination of both 
surfing and searching behaviors. Users rely on search engines 
heavily, but also navigate from page to page through static 
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Figure 5: A. Scaling relationship between traffic and 
in-degree when each page has a fixed probability h 
of being returned in response to a query. The curves 
(not normalized for visualization purposes) are ob- 
tained by simulating the process t[r(k),h] (see Ap- 
pendix [BJ. B. Distribution of relative hit set size h 
for 200,000 actual user queries from AltaVista logs. 
The hit set size data were obtained from Google |12|. 
Frequencies are normalized by logarithmic bin size. 
The log-log plot shows a power law with an exponen- 
tial cutoff. C. Scaling between traffic and in-degree 
obtained by simulating 4.5 million queries with a 
realistic distribution of hit set size on a one-million 
node network. Empirical data from Fig. I4B. 



links as they explore the neighborhoods of pages returned in 
response to search queries It would be easy to model a 
mix of our revised searching model (taking into account the 
more realistic distribution of hit set sizes) with the random 
surfing behavior. The resulting mixture model would yield 
a prediction somewhere between the linear scaling t ~ k of 
the surfing model (cf. Fig. 2J3) and the sublinear scaling of 
our searching model (cf. Fig. |Sf!). The final curve would 
be sublinear and still in agreement with the empirical traffic 
data. 

5. DISCUSSION AND OUTLOOK 

Our heavy reliance on search engines as a means of coping 
with the Web's size and growth does affect how we discover, 
link to, and visit pages. However, in spite of the rich-get- 
richer dynamics implicitly contained in the use of link anal- 
ysis to rank search hits, the net effect of search engines on 
traffic appears to produce an egalitarian effect, smearing 
out the traffic attraction of high-degree pages. Our empir- 
ical data clearly shows a sublinear scaling relation between 
referral traffic from search engines and page in-degree. This 
seems to be in agreement with the observation that search 
engines lead users to visiting about 20% more pages than 
surfing alone [29] . Such an effect may be understood within 
a theoretical model of information retrieval that considers 
the users' clicking behavior and the heavy-tailed distribu- 
tion observed for the number of query hits. 

This result has relevant conceptual and practical conse- 
quences. It suggests that, contrary to intuition and prior 
hypotheses, the use of search engines contributes to a more 
level playing field, in which new Web sites have a greater 
chance of being discovered and thus of acquiring links and 
popularity — as long as they are about specific topics that 
match the interests of users as expressed through their search 
queries. 

Such a finding is particularly relevant for the design of 
realistic models for Web growth. The connection between 
the popularity of a page and its acquisition of new links 
has led to the well-known rich-get-richer growth paradigm 
that explains many of the observed topological features of 
the Web. The present findings, however, show that several 
non-linear mechanisms involving search engine algorithms 
and user behavior regulate the popularity of pages. This 
calls for a new theoretical framework that considers more of 
the various behavioral and semantic issues that shape the 
evolution of the Web. How such a framework may yield 
coherent models that still agree with the Web's observed 
topological properties is a difficult and important theoretical 
challenge. 

Finally, the present results provide a first quantitative es- 
timate of, and prediction for, the popularity and traffic gen- 
erated by Web pages. This estimate promises to become an 
important tool to be exploited in the optimization of mar- 
keting campaigns, the generation of traffic forecasts, and the 
design of future search engines. 
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APPENDIX 

A. RELATIONSHIP BETWEEN 
IN-DEGREE AND PAGERANK 

Let us inspect the scaling relationship between in-degree 
k and PageRank p. In our calculations of PageRank we 
used a damping factor 0.85, as in the original version of 
the algorithm |7j and in many successive studies. Our nu- 
merical analysis of the PageRank for the Web graph was 
performed on two samples produced by crawls made in 2001 
and 2003 by the WebBase collaboration at Stanford |30| . 
The graphs are quite large: the former crawl has 80,571,247 
pages and 752,527,660 links; the latter has 49,296,313 pages 
and 1,185,396,953 links. 

In Fig. |S] in order to reduce fluctuations, we averaged the 
PageRank values over logarithmic bins of the degree. The 
data points mostly fall on a power law curve for both sam- 
ples, with p increasing with k. The correlation coefficients 



10' 



• WebBase, 2003 crawl 
■ WebBase, 2001 crawl 
- p~k 




10 10" 
k (in-degree) 



10 u 



Figure 6: PageRank as a function of in-degree for 
two samples of the Web taken in 2001 and 2003 |3"U| . 



of the two sets of data, before binning, are 0.54 and 0.48 
for the 2001 and 2003 crawl, respectively, as found for the 
Web domain of the University of Notre Dame |25|. but in 
disagreement with the results of an analysis on the domain 
of Brown University and the WTlOg Web snapshot |27|. 
The estimated exponents of the power law fits for the two 
curves are 1.1 ±0.1 (2001) and 0.9±0.1 (2003). As shown in 
Fig. the two estimates are compatible with a simple linear 
relation between PageRank and in-degree. A linear scaling 
relation between p and k is also consistent with the observa- 
tion that both have the same distribution. As it turns out, 
p and k are both distributed according to a power law with 
estimated exponent —2.1 ± 0.1, in agreement with other es- 
timates |27l 1111 We assume, therefore, that PageRank 
and in-degree are, on average, proportional for large values. 

B. SIMULATION OF SEARCH-DRIVEN 
WEB TRAFFIC 

When a user submits a query to a search engine, the lat- 
ter will select all pages deemed relevant from its index and 
display the corresponding links ranked according to a com- 
bination of query-dependent factors, such as the similarity 
between the terms in the query and those in the page ti- 
tle, and query-independent prestige factors such as PageR- 
ank. Here we focus on PageRank as the main global ranking 
factor, assuming that query-dependent factors are averaged 
out across queries. The number of hit results depends on 
the query and it is in general much smaller than the total 
number of pages indexed by the search engine. 

Let us start from the relation between click probability 
and rank in Eq. |3] If all N pages in the index were listed in 
each query, as implicitly assumed in ref. UJ, the probability 
for the page with the smallest PageRank to be clicked would 
be N a (a ~ 1.63 in our study) times smaller than the prob- 
ability to click on the page with the largest PageRank. If 
instead both pages ranked first and N th appear among the 
n hits of a realistic query (with n <C N), they would still 
occupy the first and the last positions of the hit list, but the 
ratio of their click probabilities would be much smaller than 
before, i.e. n a . This leads to a redistribution of the clicking 
probability in favor of the less "popular" pages, which are 



then visited much more often than one would expect at first 
glance. To quantify this effect, we must first distinguish be- 
tween the global rank induced by PageRank across all Web 
pages and the query- dependent rank among the hits returned 
by the search engine in response to a particular query. Let 
us rank all N pages in decreasing order of PageRank, such 
that the global rank is R — 1 for the page with the largest 
PageRank, followed by R — 2 and so on. 

Let us assume for the moment that all query result lists 
derive from a Bernoulli process with success probability h 
(i.e., the number of hits relevant to each query is on average 
hN). The assumption that each page can appear in the hit 
list with the same probability h is in general not true, as 
there are pages that are more likely to be relevant than oth- 
ers, depending on their size, intrinsic appeal, and so on. If 
one introduces a fitness parameter to modulate the probabil- 
ity for a page to be relevant with respect to a generic query, 
the results would be identical as long as the fitness is not 
correlated with the PageRank of the page. In what follows 
we then stick to the simple assumption of equiprobability. 

Let us calculate the probability Pr(_R, r, TV, n, h) that the 
page with global rank R has rank r within a list of n hits. 
This is the probability p^Zx to select r — 1 pages from the 
set {1...R- 1}: 
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times the probability p"l^ to select n — r pages from the 
set {R + 1 . • . TV}, times the probability h to select page R. 
So we obtain: 
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If page R has rank r in a list of n hits, the probability of 
being clicked will be 



t(R,r,N,n,h) = 



■Pr(R,r,N,n,h) (7) 



where the denominator ensures the proper normalization of 
the click probability within the hit list. What remains to 
be done is to sum over the possible ranks r of page R in 
the hit list (r £ 1 . . . n) and over all possible hit set sizes 
(n € 1 . . . TV). The final result for the probability t(R, TV, h) 
of the _R-th page to be clicked is: 
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From Eq.|U we can see that if h = 1, which corresponds to 
a list with all TV pages, one recovers Eq. [3] as expected. For 
h < 1, however, it is not possible to derive a close expression 
for t(R, N,h), so one has to calculate the binomials and 
perform the sums numerically. This can be easily done, 
but the time required to perform the calculation increases 
dramatically with TV, so that it is not realistic to push the 
computation beyond TV = 10 4 . For this reason, instead of 




Figure 7: Scaling of t(R,N,h)/h with the variable 
Rh. The three curves refer to a sample of N — 10° 
pages. 



carrying on an exact calculation, we performed Monte Carlo 
simulations of the process leading to Eq. |H| 

In each simulation we produce a large number of hit lists, 
where every list is formed by picking each page of the sam- 
ple with probability h. At the beginning of the simulation 
we initialize all entries of the array t(R, TV, h) = 0. Once a 
hit list is completed, we add to the entries of t(R, TV, h), cor- 
responding to the pages of the hit list, the click probability 
as given by Eq. |3 (with the proper normalization) . With 
this Monte Carlo method we simulated systems with up to 
TV = 10 6 items. To eliminate fluctuations we averaged the 
click probability in logarithmic bins, as already done for the 
experimental data. 

We found that the function t(R, TV, h) obeys a simple scal- 
ing law: 



t(R,N,h) = hF(Rh)A{N) 
where F(Rh) has the following form: 



F(Rh) 



const 
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(9) 



(10) 



An immediate implication of Eq. [5] is that if one plots 
t(R,N,h)/h as a function of Rh, for TV fixed, one obtains 
the same curve F(Rh)A(N), independently of the value of 
MFig.0. 

The decreasing part of the curve t(R,N,h), for Rh > 1 
i.e. R > 1/h, is the same as in the case when h = 1 (Eq.|^J. 
This means that the finite size of the hit list affects only the 
top-ranked 1/h pages. The effect is thus strongest when the 
fraction h is small, i.e., for specific queries that return few 
hits. The striking feature of Eq. 1101 is the plateau for all 
pages between the first and the 1/h-th. This implies that 
the difference in the values of PageRank among the top 1/h 
pages does not produce a difference in the probability of 
clicking on those pages. For h = 1/TV, which would corre- 
spond to lists containing on average a single hit, each of the 
TV pages would have the same probability of being clicked, 
regardless of their PageRank. This is not surprising, as we 
assumed that all pages have the same probability to appear 
in a hit list. 

So far we assumed that the number of query results is 



drawn from a binomial distribution with a mean of hN 
hits. On the other hand, we know that real queries gen- 
erate a broad range of possible hit set sizes, going from lists 
with only a single result to lists containing tens of millions 
of results. If the size of the hit list is distributed accord- 
ing to some function S(h,N), one would need to convolute 
t(R, TV, h) with S(h, TV) to get the corresponding click prob- 
ability: 



ts(R,N) 



S(h,N)t(R,N, h)dh 



(11) 



where h m and Km are the minimal and maximal fraction 
of pages in a list, respectively. We stress that if there is a 
maximal hit list size hu < 1, each curve t(R,N,h) of the 
overlap will have a flat portion going from the first to the 
l//iM-th page, so in the set of pages ranked between 1 and 
1/h.M the click probability will be flat, independently of the 
distribution function S(h,N). 

We obtained the hit list size distribution from a log of 
200,000 actual queries submitted to AltaVista in 2001 
(Fig.|Sj3). The data can be reasonably well fitted by a power 
law with an exponential cutoff due to the finite size of the 
AltaVista index. The exponent of the power law is 5 w 1.1. 
In our Monte Carlo simulations we neglected the exponential 
cutoff, and used the simple power law 



S(h, TV) = B(N)h~ 



(12) 



where the normalization constant B(N) is just a function of 
TV. The cutoff would affect only the part of the distribution 
S(h, TV) corresponding to the largest values of h, influenc- 
ing a limited portion of the curve ts(R,N) and the click 
probability of the very top pages (cf. the scaling relation 
of Eq. EOJ. As there are no real queries that return hit 
lists containing all pages, 1 we have that hu < 1- To esti- 
mate ftjn we divided the largest observed number of Google 
hits in our collection of AltaVista queries (approximately 
6.6 x 10 s ) by the total number of pages reportedly indexed 
by Google (approximately 8 x 10 9 as of this writing), yield- 
ing Hm ~ 0.1. The top-ranked 1/hm ~ 10 sites will have the 
same probability to be clicked. We then expect a flattening 
of the portion of ts(R, TV) corresponding to the pages with 
the highest PageRank/in-degree. This flattening seems con- 
sistent with the pattern observed in the real data (Fig.[H£j). 

As to the full shape of the curve ts(R,N) for the Web, 
we performed a simulation for a set of TV = 10 6 pages. We 
used h m = 1/TV, as there are hit lists with a few or even a 
single result. The size of our sample is still very far from the 
total number of pages of the Web, so in principle we could 
not match the curve derived from the simulation with the 
pattern of the real data. However, the theoretical curves 
obey a simple scaling relation, as we can see in Fig. [HI ft is 
indeed possible to prove that ts{R,N) is a function of the 
'normalized' rank R/N (and of TV) and not of the absolute 
rank R. On a log-log scale, this means that by properly 
shifting curves obtained for different TV values along the x 
and y axes it is possible to make them overlap, exactly as 
we see in Fig. |H| This allows us to safely extrapolate to the 
limit of much larger TV, and to lay the curve derived by our 

1 The policy of all search engines is to display at most 1000 
hits, and we took this into account in our simulations. This 
does not mean that h < 1000/TV; the search engine scans 
all its database and can report millions of hits, but it will 
finally display only the top 1000. 
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Figure 8: Scaling of t s (R,N) for TV = 10 4 ,10 5 ,10 6 . 
The click probability t is multiplied for each curve 
by a number /(TV) that depends only on TV. In the 

limit TV — > oo, /(TV) TV. 



simulation on the empirical data (as we did in Fig.|5p). The 
argument is rather simple, and is based on the ansatz of 
Eq.|5|for the function t(R,N,h) and the power law form of 
the distribution S(h,N) (Ea. 112^ . If we perform the convo- 
lution of Eq. [TT] 

we have 

f h M 

ts(R,N)= S(h,N)hA(N)F(Rh)dh, (13) 

Jl/N 

where we explicitly set h m = 1/TV and F(Rh) is the universal 
function of Eq. I1UI By plugging the explicit expression of 
S(h,N) from Eq. 1 121 into Eq. 1131 and performing the simple 
change of variable z = hN within the integral we obtain 



ts(R,N) 



A(N)B(N) 

TV 2 -* 



dz. (14) 



The upper integration limit can be safely set to infinity be- 
cause h-M TV is very large. The integral in Eo. 1141 thus be- 
comes a function of the ratio R/N. The additional explicit 
dependence on TV, expressed by the term outside the in- 
tegral, consists in a simple multiplicative factor /(TV) that 
does not affect the shape of the curve (cf. Fig. |SJ . 

We finally remark that the expression ts(R,N) that we 
derived by simulation represents the relation between the 
click probability and the global rank of a page as deter- 
mined by the value of its PageRank. For a comparison with 
the empirical data of Fig. |Sp we need a relation between 
click probability and in-degree. We can relate rank to in- 
degree by means of Eq. between rank and PageRank and 
by exploiting the proportionality between PageRank and in- 
degree discussed earlier. 

However both Eq.Qand the proportionality between p and 
k are not rigorous, but only hold in the asymptotic regime 
of low rank/large in-degree. If it were feasible to simulate 
queries on a Web graph with O(10 10 ) nodes, the theoreti- 
cal curve in Fig. [IJJ would extend over the entire range of 
the x-axis. In this case the low-fc part of the curve would 
have to be adjusted to account for the flattening observed in 
Fig. which displays the relation between PageRank and 
in-degree. The leftmost part of this curve is quite flat for 



over one order of magnitude, giving a plausible explanation 
for the flat pattern of the low-fc data in Fig.|^f]. 



